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Abstract 

We show how to train the fast dependency 
parser of Smith and Eisner (2008) for im¬ 
proved accuracy. This parser can consider 
higher-order interactions among edges while 
retaining 0(n 3 ) runtime. It outputs the 
parse with maximum expected recall—but for 
speed, this expectation is taken under a pos¬ 
terior distribution that is constructed only ap¬ 
proximately, using loopy belief propagation 
through structured factors. We show how to 
adjust the model parameters to compensate for 
the errors introduced by this approximation, 
by following the gradient of the actual loss on 
training data. We find this gradient by back- 
propagation. That is, we treat the entire parser 
(approximations and all) as a differentiable 
circuit, as Stoyanov et al. (2011) and Domke 
(2010) did for loopy CRFs. The resulting 
trained parser obtains higher accuracy with 
fewer iterations of belief propagation than one 
trained by conditional log-likelihood. 

1 Introduction 

Recent improvements to dependency parsing ac¬ 
curacy have been driven by higher-order features. 
Such a feature can look beyond just the parent and 
child words connected by a single edge to also con¬ 
sider siblings, grand-parents, etc. By including in¬ 
creasingly global information, these features pro¬ 
vide more information for the parser—but they also 
complicate inference. The resulting higher-order 
parsers depend on approximate inference and decod¬ 
ing procedures, which may prevent them from pre¬ 
dicting the best parse. 

For example, consider the dependency parser we 
will train in this paper, which is based on the work 
of Smith and Eisner (2008). Ostensibly, this parser 


finds the minimum Bayes risk (MBR) parse under 
a probability distribution defined by a higher-order 
dependency parsing model. In reality, however, it 
achieves 0{n 3 T ) runtime by relying on three ap¬ 
proximations during inference : (1) variational infer¬ 
ence by loopy belief propagation (BP) on a factor 
graph, (2) early stopping of inference after £ max it¬ 
erations prior to convergence, and (3) a first-order 
pruning model to limit the number of edges consid¬ 
ered in the higher-order model. Such parsers are tra¬ 
ditionally trained as if the inference had been exact 
(Smith and Eisner, 2008). 1 

In contrast, we train the parser such that the ap¬ 
proximate system performs well on the final evalua¬ 
tion function. Stoyanov and Eisner (2012) call this 
approach ERMA, for “empirical risk minimization 
under approximations.” We treat the entire parsing 
computation as a differentiable circuit, and back- 
propagate the evaluation function through our ap¬ 
proximate inference and decoding methods to im¬ 
prove its parameters by gradient descent. 

Our primary contribution is the application of 
Stoyanov and Eisner’s learning method in the pars¬ 
ing setting, for which the graphical model involves 
a global constraint. Smith and Eisner (2008) pre¬ 
viously showed how to run BP in this setting (by 
calling the inside-outside algorithm as a subroutine). 
We must backpropagate the downstream objective 
function through their algorithm so that we can fol¬ 
low its gradient. We carefully define our objec¬ 
tive function to be smooth and differentiable, yet 
equivalent to accuracy of the minimum Bayes risk 
(MBR) parse in the limit. Further we introduce a 
new simpler objective function based on the L 2 dis- 

'For perceptron training, utilizing inexact inference as a 
drop-in replacement for exact inference can badly mislead the 
learner (Kulesza and Pereira, 2008). 




Figure 1: Factor graph for dependency parsing of a 4- 
word sentence; the special node <ROOT> is the root of 
the dependency graph. In this figure, the boolean variable 
yh,m encodes whether the edge from parent h to child 
m is present. The unary factor (black) connected to this 
variable scores the edge in isolation (given the sentence). 
The PTree factor (red) coordinates all variables to en¬ 
sure that the edges form a tree. The drawing shows a 
few of the higher-order factors (purple factors for grand¬ 
parents, green factors for arbitrary siblings); these are re¬ 
sponsible for the graph being cyclic (“loopy”). 

tance between the approximate marginals and the 
“true” marginals from the gold data. 

The goal of this work is to account for the ap¬ 
proximations made by a system rooted in struc¬ 
tured belief propagation. Taking such approxima¬ 
tions into account during training enables us to 
improve the speed and accuracy of inference at 
test time. To this end, we compare our training 
method with the standard approach of conditional 
log-likelihood. We evaluate our parser on 19 lan¬ 
guages from the CoNLL-2006 (Buchholz and Marsi, 
2006) and CoNLL-2007 (Nivre et al., 2007) Shared 
Tasks as well as the English Penn Treebank (Marcus 
et al., 1993). On English, the resulting parser obtains 
higher accuracy with fewer iterations of BP than 
standard conditional log-likelihood (CLL) training. 
On the CoNLL languages, we find that on average 
it yields higher accuracy parsers than CLL training, 
particularly when limited to few BP iterations. 

2 Dependency Parsing by Belief 
Propagation 

This section describes the parser that we will train. 


Model A factor graph (Frey et al., 1997; Kschis- 
chang et al., 2001) is a bipartite graph between fac¬ 
tors a and variables y % , and defines the factorization 
of a probability distribution over a set of variables 
The factor graph contains edges be¬ 
tween each factor a and a subset of variables y a . 
Each factor has a local opinion about the possible 
assignments to its neighboring variables. Such opin¬ 
ions are given by the factor’s potential function ij; a , 
which assigns a nonnegative score to each config¬ 
uration of a subset of variables y a . We define the 
probability of a given assignment y to be propor¬ 
tional to a product of potential functions: p(y) = 

HL Mu a)- 

Smith and Eisner (2008) define a factor graph for 
dependency parsing of a given n-word sentence: n 2 
binary variables {yi, 7 / 2 , • • •} indicate which of the 
directed arcs are included (y l = ON) or excluded 
(y t = OFF) in the dependency parse. One of the 
factors plays the role of a hard global constraint: 
^ptree (y) is 1 or 0 according to whether the as¬ 
signment encodes a projective dependency tree. An¬ 
other 0(n 2 ) factors (one per variable) evaluate the 
individual arcs given the sentence, so that p(y) de¬ 
scribes a first-order dependency parser. A higher- 
order parsing model is achieved by including higher- 
order factors, each scoring configurations of two or 
more arcs, such as grandparent and sibling configu¬ 
rations. Higher-order factors add cycles to the factor 
graph. See Figure 1 for an example factor graph. 

We define each potential function to have a log- 
linear form: i/> a (y a ) = exp(0 • f a (y a ,x)). Here 
x is the vector of observed variables such as the 
sentence and its POS tags; f a extracts a vector of 
features; and 6 is our vector of model parameters. 
We write the resulting probability distribution over 
parses as po(y ), to indicate that it depends on 6. 

Loss For dependency parsing, our loss function is 
the number of missing edges in the predicted parse 
y , relative to the reference (or “gold”) parse y*\ 

e{y,y*)= Y <*(& = OFF ) (!) 

i:y*= ON 

Because y and y * each specify exactly one parent 
for each word token, £(y,y*) equals the number of 
word tokens whose parent is predicted incorrectly— 
that is, directed dependency error. 










Decoder To obtain a single parse as output, we use 
a minimum Bayes risk (MBR) decoder, which at¬ 
tempts to find the tree with minimum expected loss 
under the model’s distribution (Bickel and Doksum, 
1977). For our directed dependency error loss func¬ 
tion, we obtain the following decision rule: 

h$(x) = argmin ^p g (y\xMy^y)} ( 2 ) 

y 

= argmax ^ p e (yi = ON|a;) (3) 

^ i:yi =ON 

Here y ranges over well-formed parses. Thus, our 
parser seeks a well-formed parse ho(x) whose in¬ 
dividual edges have a high probability of being cor¬ 
rect according to po- MBR is the principled way 
to take a loss function into account under a prob¬ 
abilistic model. By contrast, maximum a posteriori 
(MAP) decoding does not consider the loss function. 
It would return the single highest-probability parse 
even if that parse, and its individual edges, were un¬ 
likely to be correct. 2 

All systems in this paper use MBR decoding to 
consider the loss function at test time. This implies 
that the ideal training procedure would be to find 
the true po so that its marginals can be used in (3). 
Our baseline system attempts this. In practice, how¬ 
ever, we will not be able to find the true po (model 
misspecification) nor exactly compute the marginals 
of po (computational intractability). Thus, this pa¬ 
per proposes a training procedure that compensates 
for the system’s approximations, adjusting 6 to re¬ 
duce the actual loss of ho (x) as measured at training 
time. 

To find the MBR parse, we first run inference to 
compute the marginal probability poiyi = ON) for 
each edge. Then we maximize (3) by running a first- 
order dependency parser with edge scores equal to 
those probabilities. 3 When our inference algorithm 
is approximate, we replace the exact marginal with 
its approximation—the normalized belief from BP, 
given by bi( ON) in (6) below. 

Inference Loopy belief propagation (BP) (Mur¬ 
phy et al., 1999) computes approximations to the 

2 If we used a simple 0-1 loss function within (2), then MBR 
decoding would reduce to MAP decoding. 

3 Prior work (Smith and Eisner, 2008; Bansal et al., 2014) 
used the log-odds ratio log as e<3 & e scores f° r 

decoding, but this yields a parse different from the MBR parse. 


variable marginals po{y) and the factor marginals 
PoiVa)- The algorithm proceeds by iteratively 
sending messages from variables, y t , to factors, 

= II (4) 

/3eJ\f(i)\a 

and from factors to variables: 

m a\i(yi) = J2 ^{y a ) II 

Vcx~Vi jeAT(a)\i (5) 

where A f(i) and Nipt) denote the neighbors of y L 
and respectively, and where y a ~ y h is standard 
notation to indicate that y a ranges over all assign¬ 
ments to the variables participating in the factor a 
provided that the zth variable has value y % . Note that 
the messages at time t are computed from those at 
time it — 1). Messages at the final time £ max are used 
to compute the beliefs at each factor and variable: 


bi(yi) 

= n m 


(6) 


a£j\f(i) 



b cx(y a . 

) = i>a{y a , 

i n "&’<!«) 

(7) 



ieAf(a) 


of the 

functions defined by equations 

(4)- 


(7) can be optionally rescaled by a constant at any 
time, e.g., to prevent overflow/underflow. Below, we 
specifically assume that each function bi has been 
rescaled such that bi(y) = 1. This bi approxi¬ 

mates the marginal distribution over y l values. 

Messages continue to change indefinitely if the 
factor graph is cyclic, but in the limit, the rescaled 
messages may converge. Although the equations 
above update all messages in parallel, convergence 
is much faster if only one message is updated per 
timestep, in some well-chosen serial order. 4 

For the PTree factor, the summation over vari¬ 
able assignments required for ra^l^(^) in Eq. (5) 
equates to a summation over exponentially many 
projective parse trees. However, we can use an 
inside-outside variant of the algorithm of Eisner 

following Dreyer and Eisner (2009, footnote 22), we 
choose an arbitrary directed spanning tree rooted at the PTree 
factor. We visit the nodes in topologically sorted order (starting 
at the leaves) and update any message from the node being vis¬ 
ited to a node that is later in the order (e.g., closer to the root). 
We then reverse this order and repeat, so that every message has 
been passed once. This constitutes one iteration of BP. 



(1996) to compute this in polynomial time (we de¬ 
scribe this as hypergraph parsing in § 3). The re¬ 
sulting “structured BP” inference procedure is exact 
for first-order dependency parsing, and approximate 
when high-order factors are incorporated. The ad¬ 
vantage of BP is that it enables fast approximate in¬ 
ference when exact inference is too slow. See Smith 
and Eisner (2008) for details. 5 

3 Approximation-aware Learning 

We aim to find the parameters 0* that minimize a 
regularized objective function over the training sam¬ 
ple of sentence/parse pairs {{x^ d \ y^)}^ =v 

A 1 D 

6* = argmin -||0||| + - £ J(0; x^ d \y^) 

* 2 D d= i (8) 

where A > 0 is the regularization coefficient and 
J(6,x,y) is a given differentiable function, pos¬ 
sibly nonconvex. We locally minimize this objec¬ 
tive using ^ 2 -regularized AdaGrad with Composite 
Mirror Descent (Duchi et al., 2011)—a variant of 
stochastic gradient descent that uses mini-batches, 
an adaptive learning rate per dimension, and sparse 
lazy updates from the regularizes 6 

Objective Functions As in Stoyanov et al. (2011), 
our aim is to minimize expected loss on the true data 
distribution over sentence/parse pairs (X, Y): 

r = argmin* E [£(h e (X),Y)} (9) 

Since the true data distribution is unknown, we 
substitute the expected loss over the training sam¬ 
ple, and regularize our objective to reduce sam¬ 
pling variance. Specifically, we aim to minimize 
the regularized empirical risk, given by (8) with 
J{6]x^ d \y^) set to £(he(x^),y^). Using our 
MBR decoder he in (3), this loss function would 

5 How slow is exact inference for dependency parsing? For 
certain choices of higher-order factors, polynomial time is pos¬ 
sible via dynamic programming (McDonald et al., 2005; Car¬ 
reras, 2007; Koo and Collins, 2010). However, BP will typically 
be asymptotically faster (for a fixed number of iterations) and 
faster in practice. In some other settings, exact inference is NP- 
hard. In particular, non-projective parsing becomes NP-hard 
with even second-order factors (McDonald and Pereira, 2006). 
BP can handle this case in polynomial time by replacing the 
PTree factor with a Tree factor that allows edges to cross. 

6 6 is initialized to 0 when not otherwise specified. 


not be differentiable because of the argmax in the 
definition of he (3). We will address this be¬ 
low by substituting a differentiable softmax. This 
is the “ERMA” method of Stoyanov and Eisner 
(2012). We will also consider simpler choices of 
J{6 ; x^ d \y^) that are commonly used in training 
neural networks. Finally, the standard convex objec¬ 
tive is conditional log-likelihood (§ 4). 

Gradient Computation To compute the gradi¬ 
ent V# J(0; x, y*) of the loss on a single sentence 
(x,y*) = (x( d \y( d )), we apply automatic differ¬ 
entiation (AD) in the reverse mode (Griewank and 
Corliss, 1991). This yields the same type of “back- 
propagation” algorithm that has long been used for 
training neural networks (Rumelhart et al., 1986). 
In effect, we are regarding (say) 5 iterations of the 
BP algorithm on sentence x , followed by (softened) 
MBR decoding and comparison to the target out¬ 
put y*, as a kind of neural network that computes 
£(he(x),y*). It is important to note that the re¬ 
sulting gradient computation algorithm is exact up 
to floating-point error, and has the same asymptotic 
complexity as the original decoding algorithm, re¬ 
quiring only about twice the computation. The AD 
method applies provided that the original function is 
indeed differentiable with respect to 0, an issue that 
we take up below. 

In principle, it is possible to compute the gradi¬ 
ent with minimal additional coding. There exists 
AD software (some listed at autodif f. org) that 
could be used to derive the necessary code automat¬ 
ically. Another option would be to use the pertur¬ 
bation method of Domke (2010). However, we im¬ 
plemented the gradient computation directly, and we 
describe it here. 

3.1 Inference, Decoding, and Loss as a 
Feedfoward Circuit 

The backpropagation algorithm is often applied to 
neural networks, where the topology of a feedfor¬ 
ward circuit is statically specified and can be ap¬ 
plied to any input. Our BP algorithm, decoder, and 
loss function similarly define a feedfoward circuit 
that computes our function J. However, the circuit’s 
topology is defined dynamically (per sentence x^) 
by “unrolling” the computation into a graph. 

Figure 2 shows this topology for one choice of ob- 



jective function. The high level modules consist of 
(A) computing potential functions, (B) initializing 
messages, (C) sending messages, (D) computing be¬ 
liefs, and (E) decoding and computing the loss. We 
zoom in on two submodules: the first computes mes¬ 
sages from the PTree factor efficiently (C.1-C.3); 
the second computes a softened version of our loss 
function (E.1-E.3). Both of these submodules are 
made efficient by the inside-outside algorithm. 

The remainder of this section describes additional 
details of how we define the function J (the forward 
pass) and how we compute its gradient (the back¬ 
ward pass). Backpropagation computes the deriva¬ 
tive of any given function specified by an arbitrary 
circuit consisting of elementary differentiable oper¬ 
ations (e.g. , x, v, log, exp). This is accom¬ 

plished by repeated application of the chain rule. 

Backpropagating through an algorithm proceeds 
by similar application of the chain rule, where the in¬ 
termediate quantities are determined by the topology 
of the circuit. Doing so with the circuit from Figure 
2 poses several challenges. Eaton and Ghahramani 
(2009) and Stoyanov et al. (2011) showed how to 
backpropagate through the basic BP algorithm, and 
we reiterate the key details below (§ 3.3). The re¬ 
maining challenges form the primary technical con¬ 
tribution of this paper: 

1. Our true loss function £(he(x), y*) by way of 
the decoder (3) contains an argmax over trees 
and is therefore not differentiable. We show 
how to soften this decoder, making it differen¬ 
tiable (§ 3.2). 

2. Empirically, we find the above objective diffi¬ 
cult to optimize. To address this, we substitute 
a simpler L 2 loss function (commonly used in 
neural networks). This is easier to optimize and 
yields our best parsers in practice (§ 3.2). 

3. We show how to run backprop through 
the inside-outside algorithm on a hypergraph 
(§ 3.5), and thereby on the softened decoder 
and computation of messages from the PTree 
factor. This allows us to go beyond Stoy¬ 
anov et al. (2011) and train structured BP in an 
approximation-aware and loss-aware fashion. 



Figure 2: Feed-forward topology of inference, decoding, 
and loss. (E) shows the annealed risk , one of the objec¬ 
tive functions we consider. 

3.2 Differentiable Objective Functions 

Annealed Risk Directed dependency error, 
£(he(x),y*), is not differentiable due to the 
argmax in the decoder he. We therefore redefine 
J(0; x, y*) to be a new differentiable loss function, 
the annealed risk R l 2 3 J T (x,y*) 9 which approaches 
the loss £(ho(x),y*) as the temperature T —» 0. 

This is done by replacing our non-differentiable 
decoder he with a differentiable one (at training 
time). As input, it still takes the marginals pe{yi = 
ON | x), or in practice, their BP approximations 
bi( ON). We define a distribution over parse trees: 

% ,T {y ) oc exp I ^ p e (yi = ON|*)/T J (10) 

\i-Vi =ON / 

Imagine that at training time, our decoder stochas¬ 
tically returns a parse y sampled from this distribu¬ 
tion. Our risk is the expected loss of that decoder: 

Rl /T (x,y*) = E u T [£(y,y*)} (11) 

y y e 

As T —>> 0 (“annealing”), the decoder almost always 




















chooses the MBR parse, 7 so our risk approaches the 
loss of the actual MBR decoder that will be used at 
test time. However, as a function of 6 , it remains 
differentiable (though not convex) for any T > 0. 

To compute the annealed risk, observe that it sim- 
plifies to Rl /T (x,y*) = - £i:y*=oN q l J T {Vi = 
on). This is the negated expected recall of a 
parse y ~ q e . We obtain the required marginals 

q^ T (yi = ON) from (10) by running inside-outside 
where the edge weight for edge i is given by 

exp (po(yi = on| x)/T). 

With the annealed risk as our J function, we can 
compute X7qJ by backpropagating through the com¬ 
putation in the previous paragraph. The computa¬ 
tions of the edge weights and the expected recall 
are trivially differentiable. The only challenge is 
computing the partials of the marginals differentiat¬ 
ing the function computed by this call to the inside- 
outside algorithm; we address this in Section 3.5. 
Figure 2 (E.1-E.3) shows where these computations 
lie within the circuit. 

Whether our test-time system computes the 
marginals of pe exactly or does so approximately 
via BP, our new training objective approaches (as 
T -A 0) the true empirical risk of the test-time parser 
that performs MBR decoding from the computed 
marginals. Empirically, however, we will find that 
it is not the most effective training objective (§ 5.2). 
Stoyanov et al. (2011) postulate that the nonconvex¬ 
ity of empirical risk may make it a difficult function 
to optimize (even with annealing). Our next two ob¬ 
jectives provide alternatives. 

L 2 Distance We can view our inference, decoder, 
and loss as defining a form of deep neural network, 
whose topology is inspired by our linguistic knowl¬ 
edge of the problem (e.g., the edge variables should 
define a tree). This connection to deep learning al¬ 
lows us to consider training methods akin to super¬ 
vised layer-wise training. We temporarily remove 
the top layers of our network (i.e. the decoder and 
loss module, Fig. 2 (E)) so that the output layer of 
our “deep network” consists of the normalized vari- 

7 Recall from (3) that the MBR parse is the tree y that max¬ 
imizes the sum ^2 i:iji=ON Po(yi = ON|cc). As T -> 0, the 
right-hand side of (10) grows fastest for this y , so its probabil¬ 
ity under qJ T approaches 1 (or 1/k if there is a k -way tie for 
MBR parse). 


able beliefs bi{yi) from BR We can then define a 
supervised loss function directly on these beliefs. 

We don’t have supervised data for this layer of 
beliefs, but we can create it artificially. Use the 
supervised parse y* to define “target beliefs” by 
b*(yi) = My, = y*) € {0,1}. To find parame- 
ters 6 that make BP’s beliefs close to these targets, 
we can minimize an L 2 distance loss function: 

J(0-X,y*) = XX&fo*) - 6 *(y *)) 2 ( 12 ) 

i Vi 

We can use this L 2 distance objective function for 
training, adding the MBR decoder and loss evalua¬ 
tion back in only at test time. 

Layer-wise Training Just as in layer-wise train¬ 
ing of neural networks, we can take a two-stage ap¬ 
proach to training. First, we train to minimize the 
L 2 distance. Then, we use the resulting 6 as ini¬ 
tialization to optimize the annealed risk, which does 
consider the decoder and loss function (i.e. the top 
layers of Fig. 2). Stoyanov et al. (2011) found mean 
squared error (MSE) to give a smoother training ob¬ 
jective, though still non-convex, and similarly used 
it to find an initializer for empirical risk. Though 
their variant of the L 2 objective did not completely 
dispense with the decoder as ours does, it is a similar 
approach to our proposed layer-wise training. 

3.3 Backpropagation through BP 

Belief propagation proceeds iteratively by sending 
messages. We can label each message with a times¬ 
tamp t (e.g. mf\ a ) indicating the time step at which 
it was computed. Figure 2 (B) shows the messages 
at time t — 0, denoted which are initial¬ 

ized to the uniform distribution. Figure 2 (C) depicts 
the computation of all subsequent messages via Eqs. 
(4) and (5). Messages at time t are computed from 
messages at time t — 1 or before and the potential 
functions After the final iteration T, the beliefs 
bi(yi), b a (y a ) are computed from the final messages 
m i^ a us i n g Eqs. (6) and (7)—this is shown in Fig¬ 
ure 2 (D). Optionally, we can normalize the mes¬ 
sages after each step to avoid overflow (not shown 
in the figure) as well as the beliefs. 

Except for the messages sent from the PTree 
factor, each step of BP computes some value from 



earlier values using a simple formula. Back- 
propagation differentiates these simple formulas. 
This lets it compute J’s partial derivatives with re¬ 
spect to the earlier values, once its partial derivatives 
have been computed with respect to later values. Ex¬ 
plicit formulas can be found in the appendix of Stoy- 
anov et al. (2011). 

3.4 BP and backpropagation with PTree 

The PTree factor has a special structure that we ex¬ 
ploit for efficiency during BP. Stoyanov et al. (2011) 
assume that BP takes an explicit sum in (5). For the 
PTree factor, this equates to a sum over all projec¬ 
tive dependency trees (since ^ptree(2 /) = 0 for any 
assignment y which is not a tree). There are expo¬ 
nentially many such trees. However, Smith and Eis¬ 
ner (2008) point out that for a — PTree, the sum¬ 
mation has a special structure that can be exploited 
by dynamic programming. 

To compute the factor-to-variable messages from 
a — PTree, they first run the inside-outside algo¬ 
rithm where the edge weights are given by the ra¬ 
tios of the messages to PTree: . Then 

they multiply each resulting edge marginal given by 
inside-outside by the product of all the OFF mes¬ 
sages Yli mfXai off) to get the marginal factor be¬ 
lief b a (yi). Finally they divide the belief by the in¬ 
coming message m® a (ON) to get the correspond¬ 
ing outgoing message m^^(ON). 

These steps are shown in Figure 2 (C.1-C.3), and 
are repeated each time we send a message from the 
PTree factor. The derivatives of the message ratios 
and products mentioned here are trivial. Though 
we focus here on projective dependency parsing, 
our techniques are also applicable to non-projective 
parsing and the Tree factor; we leave this to fu¬ 
ture work. In the next subsection, we explain how to 
backpropagate through the inside-outside algorithm. 

3.5 Backpropagation through Inside-Outside 
on a Hypergraph 

Both the annealed risk loss function (§ 3.2) and the 
computation of messages from the PTree factor use 
the inside-outside algorithm for dependency pars¬ 
ing. Here we describe inside-outside and the ac¬ 
companying backpropagation algorithm over a hy¬ 
pergraph. This more general treatment shows the ap¬ 


plicability of our method to other structured factors 
such as for CNF parsing, HMM forward-backward, 
etc. In the case of dependency parsing, the structure 
of the hypergraph is given by the dynamic program¬ 
ming algorithm of Eisner (1996). 

For the forward pass of the inside-outside mod¬ 
ule, the input variables are the hyperedge weights 
w e \/e and the outputs are the marginal probabilities 
Pw {i)Mi of each node i in the hypergraph. The latter 
are a function of the inside fi L and outside ay proba- 


bilities. We initialize a root = 1. 


& = e We n p 

(13) 

eel(i) jeT(e) 


°Lj = W e OLjj{e) JJ Pj 

(14) 

eeO(i ) jeT(e):j^i 


Pw(i ) = OLiPi/Pr oot 

(15) 


For each node i, we define the set of incoming edges 
I(i) and outgoing edges 0(i). The antecedents of 
the edge are T(e), the parent of the edge is H(e), 
and its weight is w e . 

Below we use the concise notation of an adjoint 
dy = dy ’ a derivative with respect to objective J. 
For the backward pass through the inside-outside 
AD module, the inputs are dp w (i)Vi and the out¬ 
puts are dw e \/e. We also compute the adjoints of the 
intermediate quantities 9/3j, 9c^. We first compute 
(3a^ bottom-up. Next 9 /3j are computed top-down. 
The adjoints dw e are then computed in any order. 

dai = dp w (i)^f-+ E ]T (16) 

e€/(i) j€T(e) 

3/3 root = y <$Pw(i) ( 17 ) 

i^root 

3/3^3 P W (J)^+ E 3/? H( e)^ff (18) 

eeo(j) 

+ E E 3a fc Vj ^ root (19) 

eeO(j) keT(e):k^j 


dw e =<5 + E 3«ife 

jeT(e) 


( 20 ) 


Below, we show the partial derivatives required for 
the adjoint computations. 

dp w (i) _ a ,a dp w (i) _ q HQ* A 

- Pz/Proof5 r^n — root)? 


dai 

dp w (i) 

dfi t 


dfin 


^i/fin 



For some edge e, let i — H (e) be the parent of the 
edge and j, k G T(e) be among its antecendents. 
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This backpropagation method is used for both Fig¬ 
ure 2 C.2 and E.2. 


4 Other Learning Settings 

Loss-aware Training with Exact Inference 

Backpropagating through inference, decoder, and 
loss need not be restricted to approximate inference 
algorithms. Li and Eisner (2009) optimize Bayes 
risk with exact inference on a hypergraph for 
machine translation. Each of our differentiable loss 
functions (§ 3.2) can also be coupled with exact 
inference. For a first-order parser, BP is exact. Yet, 
in place of modules (B), (C), and (D) in Figure 2, we 
can use a standard dynamic programming algorithm 
for dependency parsing, which is simply another 
instance of inside-outside on a hypergraph (§ 3.5). 
The exact marginals from inside-outside (15) are 
then fed forward into the decoder/loss module (E). 

Conditional and Surrogate Log-likelihood The 

standard approach to training is conditional log- 
likelihood (CLL) maximization (Smith and Eisner, 
2008), which does not take inexact inference into 
account. When inference is exact, this baseline 
computes the true gradient of CLL. When infer¬ 
ence is approximate, this baseline uses the approxi¬ 
mate marginals from BP in place of their exact val¬ 
ues in the gradient. The literature refers to this 
approximation-wnaware training method as surro¬ 
gate likelihood training since it returns the “wrong” 
model even under the assumption of infinite train¬ 
ing data (Wainwright, 2006). Despite this, the surro¬ 
gate likelihood objective is commonly used to train 
CRFs. CLL and approximation-aware training are 
not mutually exclusive. Training a standard factor 
graph with ERMA and a log-likelihood objective re¬ 
covers CLL exactly (Stoyanov et al., 2011). 


5 Experiments 
5.1 Setup 

Features As the focus of this work is on a novel 
approach to training, we look to prior work for 
model and feature design. We add 0(n 3 ) second- 
order grandparent and arbitrary sibling factors as in 
Riedel and Smith (2010) and Martins et al. (2010). 
We use standard feature sets for first-order (McDon¬ 
ald et al., 2005) and second-order (Carreras, 2007) 
parsing. Following Rush and Petrov (2012), we also 
include a version of each part-of-speech (POS) tag 
feature, with the coarse POS tags from Petrov et 
al. (2012). We use feature hashing (Ganchev and 
Dredze, 2008; Attenberg et al., 2009) and restrict to 
at most 20 million features. We leave the incorpora¬ 
tion of third-order features to future work. 

Pruning To reduce the time spent on feature ex¬ 
traction, we enforce the type-specific dependency 
length bounds from Eisner and Smith (2005) as used 
by Rush and Petrov (2012): the maximum allowed 
dependency length for each tuple (parent tag, child 
tag, direction) is given by the maximum observed 
length for that tuple in the training data. Follow¬ 
ing Koo and Collins (2010), we train an (exact) 
first-order model and for each token prune any par¬ 
ents for which the marginal probability is less than 
0.0001 times the maximum parent marginal for that 
token. 8 On a per-token basis, we further restrict to 
the ten parents with highest marginal probability as 
in Martins et al. (2009). The pruning model uses a 
simpler feature set as in Rush and Petrov (2012). 

Data We consider 19 languages from the CoNLL- 
2006 (Buchholz and Marsi, 2006) and CoNLL-2007 
(Nivre et al., 2007) Shared Tasks. We also convert 
the English Penn Treebank (PTB) (Marcus et al., 
1993) to dependencies using the head rules from Ya- 
mada and Matsumoto (2003) (PTB-YM). We evalu¬ 
ate unlabeled attachment accuracy (UAS) using gold 
POS tags for the CoNLL languages, and predicted 
tags from TurboTagger 9 for the PTB. Unlike most 
prior work, we hold out 10% of each CoNLL train¬ 
ing dataset as development data. 

8 We expect this to be the least impactful of our approxima¬ 
tions: Koo and Collins (2010) report 99.92% oracle accuracy 
for English. 

9 
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Figure 3: Speed accuracy tradeoff of UAS vs. the number 
of BP iterations for standard conditional likelihood train¬ 
ing (CLL) and our approximation-aware training with ei¬ 
ther an L 2 objective (L 2 ) or a staged training of L 2 fol¬ 
lowed by annealed risk (L 2 +AR). Note that x-axis shows 
the number of iterations used for both training and test¬ 
ing. We use a 2nd-order model with Grand.+Sib. factors. 

Some of the CoNLL languages contain nonpro- 
jective edges. With the projectivity constraint, the 
model assigns zero probability to such trees. For 
approximation-aware training this is not a problem; 
however CLL training cannot handle such trees. For 
CLL only, we projectivize the training trees follow¬ 
ing (Carreras, 2007) by finding the maximum pro¬ 
jective spanning tree under an oracle model which 
assigns score +1 to edges in the gold tree and 0 to 
the others. We always evaluate on the nonprojec- 
tive trees for comparison with prior work. 

Learning Settings We compare three learning set¬ 
tings. The first, our baseline, is conditional log- 
likelihood training (CLL) (§ 4). As is common 
in the literature, we conflate two distinct learning 
settings (conditional log-likelihood/surrogate log- 
likelihood) under the single name “CLL” allowing 
the inference method (exact/inexact) to differentiate 
them. The second learning setting is approximation- 
aware learning (§ 3) with either our L 2 distance 
objective (L 2 ) or our layer-wise training method 
(L 2 +AR) which takes the L 2 -trained model as an ini¬ 
tializer for our annealed risk (§ 3.2). The annealed 
risk objective requires an annealing schedule: over 
the course of training, we linearly anneal from initial 
temperature T = 0.1 to T = 0.0001, updating T at 
each iteration of stochastic optimization. The third 


-CLL 



Unary Grand. Sib. Grand.+Sib. 


Figure 4: UAS vs. the types of 2nd-order factors included 
in the model for approximation-aware training and stan¬ 
dard conditional likelihood training. All models include 
lst-order factors (Unary). The 2nd-order models include 
grandparents (Grand.), arbitrary siblings (Sib.), or both 
(Grand.+Sib.)—and use 4 iterations of BP. 

uses the same two objectives, L 2 and L 2 +AR, but 
with exact inference (§ 4). The ^ 2 -regularizer weight 
is A = Each method is trained by AdaGrad for 
10 epochs with early stopping (i.e. the model with 
the highest score on dev data is returned). The learn¬ 
ing rate for each training run is dynamically tuned on 
a sample of the training data. 

5.2 Results 

Our goal is to demonstrate that our approximation- 
aware training method leads to improved parser ac¬ 
curacy as compared with the standard training ap¬ 
proach of conditional log-likelihood (CLL) maxi¬ 
mization (Smith and Eisner, 2008), which does not 
take inexact inference into account. The two key 
findings of our experiments are that our learning ap¬ 
proach is more robust to (1) decreasing the number 
of iterations of BP and (2) adding additional cycles 
to the factor graph in the form of higher-order fac¬ 
tors. In short: our approach leads to faster inference 
and creates opportunities for more accurate parsers. 

Speed-Accuracy Tradeoff Our first experiment is 
on English dependencies. For English PTB-YM, 
Figure 3 shows accuracy as a function of the num¬ 
ber of BP iterations for our second-order model with 
both arbitrary sibling and grandparent factors on En¬ 
glish. We find that our training methods (L 2 and 
L 2 +AR) obtain higher accuracy than standard train¬ 
ing (CLL), particularly when a small number of BP 
iterations are used and the inference is a worse ap- 

















proximation. Notice that with just two iterations of 
BP, the parsers trained by our approach obtain ac¬ 
curacy equal to the CLL-trained parser with four 
iterations. Contrasting the two objectives for our 
approximation-aware training, we find that our sim¬ 
ple L 2 objective performs very well. In fact, in only 
one case at 6 iterations, does the additional annealed 
risk (L 2 +AR) improve performance on test data. In 
our development experiments, we also evaluated AR 
without using L 2 for initialization and we found that 
it performed worse than either of CLL and L 2 alone. 
That AR performs only slightly better than L 2 (and 
not worse) in the case of L 2 +AR is likely due to early 
stopping on dev data, which guards against selecting 
a worse model. 

Increasingly Cyclic Models Figure 4 contrasts 
accuracy with the type of 2nd-order factors (grand¬ 
parent, sibling, or both) included in the model for 
English, for a fixed budget of 4 BP iterations. As we 
add additional higher-order factors, the model has 
more loops thereby making the BP approximation 
more problematic for standard CLL training. By 
contrast, our training performs well even when the 
factor graphs have many cycles. 

Notice that our advantage is not restricted to the 
case of loopy graphs. Even when we use a 1st- 
order model, for which BP inference is exact, our 
approach yields higher accuracy parsers than CLL 
training. We postulate that this improvement comes 
from our choice of the L 2 objective function. Note 
the following subtle point: when inference is ex¬ 
act, the CLL estimator is actually a special case 
of our approximation-aware learner—that is, CLL 
computes the same gradient that our training by 
backpropagation would if we used log-likelihood as 
the objective. Despite its appealing theoretical justi¬ 
fication, the AR objective that approaches empirical 
risk minimization in the limit consistently provides 
no improvement over our L 2 objective. 

Exact Inference with Grandparents When our 
factor graph includes unary and grandparent fac¬ 
tors, exact inference in 0(n 4 ) time is possible us¬ 
ing the dynamic programming algorithm for Model 
0 of Koo and Collins (2010). Table 1 compares four 
parsers, by considering two training approaches and 
two inference methods. The training approaches are 
CLL and approximation-aware inference with an L 2 


Train 

Inference 

Dev UAS 

Test UAS 

CLL 

BP 4 iters 

91.37 

91.25 

CLL 

Exact 

91.99 

91.62 

l 2 

BP 4 iters 

91.83 

91.63 

l 2 

Exact 

91.91 

91.66 


Table 1: The impact of exact vs. approximate inference 
on a 2nd-order model with grandparent factors only. Re¬ 
sults are for the development (§ 22) and test (§23) sec¬ 
tions of PTB-YM. 

objective. The inference methods are BP with only 
four iterations or exact inference by dynamic pro¬ 
gramming. On test UAS, we find that both the CLL 
and L 2 parsers with exact inference outperform ap¬ 
proximate inference—though the margin for CLL 
is much larger. Surprisingly, our L 2 -trained parser, 
which uses only 4 iterations of BP and 0{v?) run¬ 
time, does just as well as CLL with exact infer¬ 
ence. Our L 2 parser with exact inference performs 
the best. 

Other Languages Our final experiments evaluate 
our approximation-aware learning approach across 
19 languages from CoNLL-2006/2007 (Table 2). 
We find that, on average, approximation-aware 
training with an L 2 objective obtains higher UAS 
than CLL training. This result holds for both 1 st¬ 
and 2nd-order models with grandparent and sibling 
factors with 1, 2, 4, or 8 iterations of BP. Table 
2 also shows the relative improvement in UAS of 
L 2 vs CLL training for each language as we vary 
the maximum number of iterations of BP. We find 
that the approximation-aware training doesn’t al¬ 
ways outperform CLL training—only in the aggre¬ 
gate. Again, we see the trend that our training ap¬ 
proach yields more significant gains when BP is re¬ 
stricted to a small number of maximum iterations. 

6 Discussion 

The purpose of this work was to explore ERMA and 
related training methods for models which incorpo¬ 
rate structured factors. We applied these methods to 
a basic higher-order dependency parsing model, be¬ 
cause that was the simplest and first (Smith and Eis¬ 
ner, 2008) instance of structured BP. In future work, 
we hope to explore further models with structured 
factors—particularly those which jointly account for 
multiple linguistic strata (e.g. syntax, semantics, and 



Language 

lST-ORDER 

CLL L2-CLL 

CLL 

2nd-order (With give 

1 i 2 

L 2 -cll 1 CLL L2-CLL 

N NUM. BP ITER A 

4 

CLL L2-CLL 

tions) 

CLL 

8 

L 2 -cll 

ar 

77.63 

-0.26 

73.39 

+2.21 

77.05 

-0.17 

77.20 

+0.02 

77.16 

-0.07 

BG 

90.38 

-0.76 

89.18 

-0.45 

90.44 

+0.04 

90.73 

+0.25 

90.63 

-0.19 

CA 

90.47 

+0.30 

88.90 

+0.17 

90.79 

+0.38 

91.21 

+0.78 

91.49 

+0.66 

cs 

84.69 

-0.07 

79.92 

+3.78 

82.08 

+2.27 

83.02 

+2.94 

81.60 

+4.42 

DA 

87.15 

-0.12 

86.31 

-1.07 

87.41 

+0.03 

87.65 

-0.11 

87.68 

-0.10 

DE 

88.55 

+0.81 

88.06 

0.00 

89.27 

+0.46 

89.85 

-0.05 

89.87 

-0.07 

EL 

82.43 

-0.54 

80.02 

+0.29 

81.97 

+0.09 

82.49 

-0.16 

82.66 

-0.04 

EN 

88.31 

+0.32 

85.53 

+1.44 

87.67 

+1.82 

88.63 

+1.14 

88.85 

+0.96 

ES 

81.49 

-0.09 

79.08 

-0.37 

80.73 

+0.14 

81.75 

-0.66 

81.52 

+0.02 

EU 

73.69 

+0.11 

71.45 

+0.85 

74.16 

+0.24 

74.92 

-0.32 

74.94 

-0.38 

HU 

78.79 

-0.52 

76.46 

+1.24 

79.10 

+0.03 

79.07 

+0.60 

79.28 

+0.31 

IT 

84.75 

+0.32 

84.14 

+0.04 

85.15 

+0.01 

85.66 

-0.51 

85.81 

-0.59 

JA 

93.54 

+0.19 

93.01 

+0.44 

93.71 

-0.10 

93.75 

-0.26 

93.47 

+0.07 

NL 

76.96 

+0.53 

74.23 

+2.08 

77.12 

+0.53 

78.03 

-0.27 

77.83 

-0.09 

PT 

86.31 

+0.38 

85.68 

-0.01 

87.01 

+0.29 

87.34 

+0.08 

87.30 

+0.17 

SL 

79.89 

+0.30 

78.42 

+1.50 

79.56 

+1.02 

80.91 

+0.03 

80.80 

+0.34 

sv 

87.22 

+0.60 

86.14 

-0.02 

87.68 

+0.74 

88.01 

+0.41 

87.87 

+0.37 

TR 

78.53 

-0.30 

77.43 

-0.64 

78.51 

-1.04 

78.80 

-1.06 

78.91 

-1.13 

ZH 

84.93 

-0.39 

82.62 

+1.43 

84.27 

+0.95 

84.79 

+0.68 

84.77 

+1.14 

Avg. 

83.98 

+0.04 

82.10 

+0.68 

83.88 

+0.41 

84.41 

+0.19 

84.34 

+0.31 


Table 2: Results on 19 languages from CoNLL-2006/2007. For languages appearing in both datasets, the 2006 version 
was used, except for Chinese (ZH). Evaluation follows the 2006 conventions and excludes punctuation. We report 
absolute UAS for the baseline (CLL) and the improvement in UAS for L 2 over CLL (L 2 -cll) with positive/negative 
differences in blue/red. The average UAS and average difference across all languages (Avg.) is given. 


topic). Another natural extension of this work is to 
explore other types of factors: here we considered 
only exponential-family potential functions (com¬ 
monly used in CRFs), but any differentiable function 
would be appropriate, such as a neural network. 

Our primary contribution is approximation-aware 
training for structured BR While our experiments 
only consider dependency parsing, our approach is 
applicable for any constraint factor which amounts 
to running the inside-outside algorithm on a hyper¬ 
graph. Prior work has used this structured form 
of BP to do dependency parsing (Smith and Eis¬ 
ner, 2008), CNF grammar parsing (Naradowsky 
et al., 2012), TAG (Auli and Lopez, 2011), ITG- 
constraints for phrase extraction (Burkett and Klein, 
2012), and graphical models over strings (Dreyer 
and Eisner, 2009). Our training methods could be 
applied to such tasks as well. 

7 Conclusions 

We introduce a new approximation-aware learning 
framework for belief propagation with structured 
factors. We present differentiable objectives for both 


empirical risk minimization (a la. ERMA) and a 
novel objective based on L 2 distance between the in¬ 
ferred beliefs and the true edge indicator functions. 
Experiments on the English Penn Treebank and 19 
languages from CoNLL-2006/2007 shows that our 
estimator is able to train more accurate dependency 
parsers with fewer iterations of belief propagation 
than standard conditional log-likelihood training, by 
taking approximations into account. Our code is 
available in a general-purpose library for structured 
BP, hypergraphs, and backprop. 10 
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